1. Abstract

Bicycling is an activity which yields many benefits: Riders improve their health through exercise, while traffic congestion is reduced if riders move out of cars, with a corresponding reduction in pollution from carbon emissions. In recent years, Bike Sharing has become popular in a growing list of cities around the world. The NYC “CitiBike” bicycle sharing scheme went live (in midtown and downtown Manhattan) in 2013, and has been expanding ever since, both as measured by daily ridership as well as the expanding geographic footprint incorporating a growing number of “docking stations” as the system welcomes riders in Brooklyn, Queens, and northern parts of Manhattan which were not previously served.

One problem that many bikeshare systems face is money. An increase in the number of riders who want to use the system necessitates that more bikes be purchased and put into service in order to accomodate them. Heavy ridership induces wear on the bikes, requiring for more frequent repairs. However, an increase in the number of trips does not necessarily translate to an increase in revenue because riders who are clever can avoid paying surcharges by keeping the length of each trip below a specified limit (either 30 or 45 minutes, depending on user category.)

We seek to examine CitiBike ridership data, joined with daily NYC weather data, to study the impact of weather on shared bike usage and generate a predictive model which can estimate the number of trips that would be taken on each day.
The goal is to estimate future demand which would enable the system operator to make expansion plans.

Our finding is that ridership exhibits strong seasonality, with correlation to weather-related variables such as daily temperature and precipitation. Additionally, ridership is segmented by by user_type (annual subscribers use the system much more heavilty than casual users), gender (there are many more male users than female) and age (a large number of users are clustered in their late 30s).

Keywords

Bikeshare, Weather, Cycling, CitiBike, New York City

2. Introduction

Since 2013 a shared bicycle system known as CitiBike has been available in New York City. The benefits to having such a system include reducing New Yorkers’ dependence on automobiles and encouraging public health through the exercise attained by cycling. Additionally, users who would otherwise spend money on public transit may find bicycling more economical – so long as they are aware of CitiBike’s pricing constraints.

There are currently about 12,000 shared bikes which users can rent from about 750 docking stations located in Manhattan and in western portions of Brooklyn and Queens. A rider can pick up a bike at one station and return it at a different station. The system has been expanding each year, with increases in the number of bicycles available and expansion of the geographic footprint of docking stations. For planning purposes, the system operator needs to project future ridership in order to make good investments.

The available usage data provides a wealth of information which can be mined to seek trends in usage. With such intelligence, the company would be better positioned to determine what actions might optimize its revenue stream.

The rest of the paper proceeds as follows:

3. Literature review

Westland et al. examined consumer behavior in bike sharing in Beijing using a deep-learning model incorporating weather and air quality, time-series of demand, and geographical location; later adding customer segmentation. [@Westland_Mou_Yin_2019]

Jia et al. performed a retrospective study of dockless bike sharing in Shanghai to determine whether introduction of such program increased cycling. Their methodology was to survey people in various neighborhoods where the areas were selected by sampling, and the individuals were selected by interviewing individuals on the street. [@Jia_Ding_Gebel_Chen_Zhang_Ma_Fu_2019]

Jia and Fu further examined whether dockless bicycle-sharing programs promote changes in travel mode in commuting and non-commuting trips, as well as the association between change in travel mode and potential correlates, as part of the same Shanghai study. [@Jia_Fu_2019]

Dell’Amico et al. modeled bike sharing rebalancing programs initially in Reggio Emilia, Italy using branch-and-cut algorithms. [@DellAmico_Hadjicostantinou_Iori_Novellani_2014]

In a more recent paper, Dell’Amico et al. examined the bike-sharing rebalancing problem with Stochastic Demands, aimed at determining minimum cost routes for a fleet of homogeneous vehicles in order to redistribute bikes among stations. [@DellAmico_Iori_Novellani_Subramanian_2018]

Zhou analyzed massive bike-sharing data in Chicago, constructing a bike flow similarity graph and using a fast-greedy algorithm to detect spatial communities of biking flows. He examined the questions 1. How do bike flow patterns vary as a result of time, weekday or weekend, and user groups? 2. Given the flow patterns, what was the spatiotemporal distribution of the over-demand for bikes and docks in 2013 and 2014? [@Zhou_2015]

Hosford et al. surveyed participants in Vancouver, Canada and determined that public bicycle share programs are not used equally by all segments of the population. In many cities, program members tend to be male, Caucasian, employed, and have higher educations and incomes compared to the general population. Further, their study determined that the majority of bicycle share trips replace trips previously made by walking or public transit, indicating that bicycle share appeals to people who already use active and sustainable modes of transportation [@Hosford_Lear_Fuller_Teschke_Therrien_Winters_2018]

In another paper, Hosford et al. determined that that the implementation of the public bicycle share program in Vancouver was associated with greater increases in bicycling for those living and working inside the bicycle share service area relative to those outside the service area in the early phase of implementation, but this effect did not sustain over time. [@Hosford_Fuller_Lear_Teschke_Gauvin_Brauer_Winters_2018]

Schmidt observed that the number of bike-sharing programs worldwide grew from 5 in 2005 to 1,571 in 2018. He further noted that disparities in bike-sharing usage are evident around the country, with users skewing towards younger white men. [@Schmidt_2018]

Wang et al. examined the rebalancing problem and determined that the fluctuation of the available bikes and docks is not only caused by the user but also by the operators’ own (inefficient) rebalancing activities; they propose a data-driven model to generate an optimal rebalancing model while minimizing the cost of moving the bikes. [@Wang_He_Zhang_Shu_Liu_Gu_Liu_Lee_Son_2018]

Vogel and Mattfeld observe that Short rental times and one-way use lead to imbalances in the spatial distribution of bikes at stations over time, and present a case study demonstrating that Data Mining applied to operational data offers insight into typical usage patterns of bike-sharing systems and is used to forecast bike demand with the aim of supporting and improving strategic and operational planning. They analyze both operational data from Vienna’s shared bike rental system as well as local weather data over the period. [@Vogel_Mattfeld_2011]

Fuller et al. examined the impact of a public transit strike (November 2016 in Philadelphia) on usage of the bike share service in that city. [@Fuller_Luan_Buote_Auchincloss_2019]

In an earlier study, Fuller et al. examined bikeshare in Montreal by collecting samples prior to the launch of the program, and following each of the first two seasons. [Unlike other cities such as New York, the Montreal bike share system does not operate year-round. Rather, because of the especially harsh winters, their bikeshare system is dismantled each fall and reinstalled each spring.] Fuller’s methodology incorporated a 5-step logistic regression in which the weather variables entered at step 4; this rendered nonsignificant the differences between the three survey periods. [@Fuller_Gauvin_Kestens_Daniel_Fournier_Morency_Drouin_2013]

Faghih-Imani and Eluru study the decision process involved in identifying destination locations after picking up the bicycle at a BSS station. In the traditional destination/location choice approaches, the model frameworks implicitly assume that the influence of exogenous factors on the destination preferences is constant across the entire population. They propose a finite mixture multinomial logit (FMMNL) model that accommodates such heterogeneity by probabilistically assigning trips to different segments and estimating segment-specific destination choice models for each segment. Unlike the traditional destination-choice-based multinomial logit (MNL) model or mixed multinomial logit (MMNL), in an FMMNL model, we can consider the effect of fixed attributes across destinations such as users’ or origins’ attributes in the decision process. [@Faghih-Imani_Eluru_2018]

An et al. examine weather and cycling in New York City and find that weather impacts cycling rates more than topography, infrastructure, land use mix, calendar events, and peaks. They do so by exploring a series of interaction effects, which each capture the extent to which two characteristics occurring simultaneously exert a combinatorial effect on cycling ridership – e.g, how is cycling impacted when it is both wet and a weekend day or humid day in the hilliest parts of the cycling network? [@An_Zahnow_Pojani_Corcoran_2019]

Heaney et al. examine the relation between ambient temperature and bikeshare usage and to project how climate change-induced increasing ambient temperatures may influence active transportation in New York City. [@Heaney_Carrión_Burkart_Lesk_Jack_2019]

In the 1990s, Nankervis examined the effect of weather and climate on university student bicycle commuting patterns in Melbourne, Australia by examining counts of parked bicycles at local universities and correlating with the weather for each day, finding that the deterrent effect of bad weather on commuting was less than commonly believed (though still significiant.) [@Nankervis_1999]

4. Data Sources

Data sources and uploading

We obtained data from two sources:

1. CitiBike trip dataset

CitiBike makes a vast amount of data available regarding system usage as well as sales of memberships and short-term passes.

For each month since the system’s inception, there is a file containing details of (almost) every trip. (Certain “trips” are omitted from the dataset. For example, if a user checks out a bike from a dock but then returns it within one minute, the system drops such a “trip” from the listing, as such “trips” are not interesting.)

There are currently 77 monthly data files for the New York City bikeshare system, spanning July 2013 through November 2019. Each file contains a line for every trip. The number of trips per month varies from as few as 200,000 during winter months in the system’s early days to more than 2 million trips this past summer. The total number of entries was more than 90 million, resulting in 17GB of data. Because of the computational limitations which this presented, we created samples of 1/1000 and 1/100 of the data. The samples were created deterministically, by subsetting the files on each 1000th (or, 100th) row.

2. Central Park daily weather data

Also we obtained historical weather information for 2013-2019 from the NCDC (National Climatic Data Center) by submitting an online request to https://www.ncdc.noaa.gov/cdo-web/search . Although the weather may vary slightly within New York City, we opted to use just the data associated with the Central Park observations as proxy for the entire city’s weather.

We believe that the above data provides a reasonable representation of the target population (all CitiBike rides) and the citywide weather.

load(file='DATA/CB.RData')
city_bike_df = as.data.frame(CB)
head(city_bike_df)
##   trip_duration              s_time              e_time s_station_id          s_station_name    s_lat    s_long
## 1           634 2013-07-01 00:00:00 2013-07-01 00:10:34          164         E 47 St & 2 Ave 40.75323 -73.97033
## 2           437 2013-07-01 06:54:02 2013-07-01 07:01:19          479         9 Ave & W 45 St 40.76019 -73.99126
## 3          1398 2013-07-01 08:03:38 2013-07-01 08:26:56          157 Henry St & Atlantic Ave 40.69089 -73.99612
## 4          1124 2013-07-01 08:37:40 2013-07-01 08:56:24          496         E 16 St & 5 Ave 40.73726 -73.99239
## 5          1199 2013-07-01 09:16:59 2013-07-01 09:36:58          432       E 7 St & Avenue A 40.72622 -73.98380
## 6           221 2013-07-01 11:50:21 2013-07-01 11:54:02          475     E 16 St & Irving Pl 40.73524 -73.98759
##   e_station_id          e_station_name    e_lat    e_long bike_id  user_type birth_year gender
## 1          504         1 Ave & E 15 St 40.73222 -73.98166   16950   Customer         NA      0
## 2          243 Fulton St & Rockwell Pl 40.68798 -73.97847   16151 Subscriber       1987      1
## 3          375 Mercer St & Bleecker St 40.72679 -73.99695   15997 Subscriber       1987      1
## 4          500      Broadway & W 51 St 40.76229 -73.98336   17750 Subscriber       1959      2
## 5          466         W 25 St & 6 Ave 40.74395 -73.99145   17671 Subscriber       1983      2
## 6          537 Lexington Ave & E 24 St 40.74026 -73.98409   16490 Subscriber       1956      1
nrow(city_bike_df)
## [1] 92565
ncol(city_bike_df)
## [1] 15

Weather Data

# Weather data is obtained from the  NCDC (National Climatic Data Center) via https://www.ncdc.noaa.gov/cdo-web/
# click on search tool  https://www.ncdc.noaa.gov/cdo-web/search
# select "daily summaries"
# select Search for Stations
# Enter Search Term "USW00094728" for Central Park Station: 
# https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094728/detail
# "add to cart"


weatherfilenames=list.files(path="./",pattern = '.csv$', full.names = T)    # ending with .csv ; not .zip
#weatherfilenames
weatherfile <- "DATA/NYC_Weather_Data_2013-2019.csv"

## Perhaps we should rename the columns to more clearly reflect their meaning?
weatherspec <- cols(
  STATION = col_character(),
  NAME = col_character(),
  LATITUDE = col_double(),
  LONGITUDE = col_double(),
  ELEVATION = col_double(),
  DATE = col_date(format = "%F"),          #  readr::parse_datetime() :   "%F" = "%Y-%m-%d"
  #DATE = col_date(format = "%m/%d/%Y"), #col_date(format = "%F")
  AWND = col_double(),                     # Average Daily Wind Speed
  AWND_ATTRIBUTES = col_character(),
  PGTM = col_double(),                    # Peak Wind-Gust Time
  PGTM_ATTRIBUTES = col_character(),
  PRCP = col_double(),                    # Amount of Precipitation
  PRCP_ATTRIBUTES = col_character(),
  SNOW = col_double(),                    # Amount of Snowfall
  SNOW_ATTRIBUTES = col_character(),
  SNWD = col_double(),                    # Depth of snow on the ground
  SNWD_ATTRIBUTES = col_character(),
  TAVG = col_double(),                    # Average Temperature (not populated)
  TAVG_ATTRIBUTES = col_character(),
  TMAX = col_double(),                    # Maximum temperature for the day
  TMAX_ATTRIBUTES = col_character(),
  TMIN = col_double(),                    # Minimum temperature for the day
  TMIN_ATTRIBUTES = col_character(),
  TSUN = col_double(),                    # Daily Total Sunshine (not populated)
  TSUN_ATTRIBUTES = col_character(),
  WDF2 = col_double(),                    # Direction of fastest 2-minute wind
  WDF2_ATTRIBUTES = col_character(),
  WDF5 = col_double(),                    # Direction of fastest 5-second wind
  WDF5_ATTRIBUTES = col_character(),
  WSF2 = col_double(),                    # Fastest 2-minute wind speed
  WSF2_ATTRIBUTES = col_character(),
  WSF5 = col_double(),                    # fastest 5-second wind speed
  WSF5_ATTRIBUTES = col_character(),
  WT01 = col_double(),                    # Fog
  WT01_ATTRIBUTES = col_character(),
  WT02 = col_double(),                    # Heavy Fog
  WT02_ATTRIBUTES = col_character(),
  WT03 = col_double(),                    # Thunder
  WT03_ATTRIBUTES = col_character(),
  WT04 = col_double(),                    # Sleet
  WT04_ATTRIBUTES = col_character(),
  WT06 = col_double(),                    # Glaze
  WT06_ATTRIBUTES = col_character(),
  WT08 = col_double(),                    # Smoke or haze
  WT08_ATTRIBUTES = col_character(),
  WT13 = col_double(),                    # Mist
  WT13_ATTRIBUTES = col_character(),
  WT14 = col_double(),                    # Drizzle
  WT14_ATTRIBUTES = col_character(),
  WT16 = col_double(),                    # Rain
  WT16_ATTRIBUTES = col_character(),
  WT18 = col_double(),                    # Snow      
  WT18_ATTRIBUTES = col_character(),
  WT19 = col_double(),                    # Unknown source of precipitation
  WT19_ATTRIBUTES = col_character(),
  WT22 = col_double(),                    # Ice fog
  WT22_ATTRIBUTES = col_character()
)

# load all the daily weather data
weather <- read_csv(weatherfile, col_types = weatherspec)
weather_df1 = as.data.frame(weather)

# Check the number of rows and columns in weather data frame
nrow(weather_df1)
## [1] 2541
ncol(weather_df1)
## [1] 56
# Select only those columns that are useful for our analysis
weather_df = select(weather_df1, STATION, NAME, DATE, AWND, PRCP, SNOW, SNWD, TMAX, TMIN, WDF2, WDF5, WSF2, WSF5, WT01)

# Check how many columns have empty values
sapply(weather_df, function(x) sum(is.na(x)))
## STATION    NAME    DATE    AWND    PRCP    SNOW    SNWD    TMAX    TMIN    WDF2    WDF5    WSF2    WSF5    WT01 
##       0       0    2541     167       0       1       0       0       0     164     180     164     180    1696
# Perform Data Impuation on weather_df, replace empty/blank values with mean values
weather_df$AWND[is.na(weather_df$AWND)] = mean(weather_df1$AWND, na.rm=TRUE)
weather_df$SNOW[is.na(weather_df$SNOW)] = mean(weather_df1$SNOW, na.rm=TRUE)
weather_df$WDF2[is.na(weather_df$WDF2)] = mean(weather_df1$WDF2, na.rm=TRUE)
weather_df$WDF5[is.na(weather_df$WDF5)] = mean(weather_df1$WDF5, na.rm=TRUE)
weather_df$WSF2[is.na(weather_df$WSF2)] = mean(weather_df1$WSF2, na.rm=TRUE)
weather_df$WSF5[is.na(weather_df$WSF5)] = mean(weather_df1$WSF5, na.rm=TRUE)

# Again, check if the imputation removed all empty/blank values with mean values
sapply(weather_df, function(x) sum(is.na(x)))
## STATION    NAME    DATE    AWND    PRCP    SNOW    SNWD    TMAX    TMIN    WDF2    WDF5    WSF2    WSF5    WT01 
##       0       0    2541       0       0       0       0       0       0       0       0       0       0    1696
# City bike data number of rows and columns
c(nrow(city_bike_df), ncol(city_bike_df))
## [1] 92565    15
# Weather data number of rows and columns
c(nrow(weather_df), ncol(weather_df))
## [1] 2541   14
# Check the column names of city_bike_df and weather_df
colnames(city_bike_df)
##  [1] "trip_duration"  "s_time"         "e_time"         "s_station_id"   "s_station_name" "s_lat"         
##  [7] "s_long"         "e_station_id"   "e_station_name" "e_lat"          "e_long"         "bike_id"       
## [13] "user_type"      "birth_year"     "gender"
colnames(weather_df)
##  [1] "STATION" "NAME"    "DATE"    "AWND"    "PRCP"    "SNOW"    "SNWD"    "TMAX"    "TMIN"    "WDF2"    "WDF5"   
## [12] "WSF2"    "WSF5"    "WT01"
# Display head of city_bike_df and weather_df
head(city_bike_df)
##   trip_duration              s_time              e_time s_station_id          s_station_name    s_lat    s_long
## 1           634 2013-07-01 00:00:00 2013-07-01 00:10:34          164         E 47 St & 2 Ave 40.75323 -73.97033
## 2           437 2013-07-01 06:54:02 2013-07-01 07:01:19          479         9 Ave & W 45 St 40.76019 -73.99126
## 3          1398 2013-07-01 08:03:38 2013-07-01 08:26:56          157 Henry St & Atlantic Ave 40.69089 -73.99612
## 4          1124 2013-07-01 08:37:40 2013-07-01 08:56:24          496         E 16 St & 5 Ave 40.73726 -73.99239
## 5          1199 2013-07-01 09:16:59 2013-07-01 09:36:58          432       E 7 St & Avenue A 40.72622 -73.98380
## 6           221 2013-07-01 11:50:21 2013-07-01 11:54:02          475     E 16 St & Irving Pl 40.73524 -73.98759
##   e_station_id          e_station_name    e_lat    e_long bike_id  user_type birth_year gender
## 1          504         1 Ave & E 15 St 40.73222 -73.98166   16950   Customer         NA      0
## 2          243 Fulton St & Rockwell Pl 40.68798 -73.97847   16151 Subscriber       1987      1
## 3          375 Mercer St & Bleecker St 40.72679 -73.99695   15997 Subscriber       1987      1
## 4          500      Broadway & W 51 St 40.76229 -73.98336   17750 Subscriber       1959      2
## 5          466         W 25 St & 6 Ave 40.74395 -73.99145   17671 Subscriber       1983      2
## 6          537 Lexington Ave & E 24 St 40.74026 -73.98409   16490 Subscriber       1956      1
head(weather_df)
##       STATION                        NAME DATE AWND PRCP SNOW SNWD TMAX TMIN WDF2 WDF5 WSF2 WSF5 WT01
## 1 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 6.93    0    0    0   40   26  310  300 15.0 25.9   NA
## 2 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 5.82    0    0    0   33   22  310  340 15.0 21.9   NA
## 3 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 4.47    0    0    0   32   24  260  260 13.0 19.9   NA
## 4 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 8.05    0    0    0   37   30  290  250 17.9 28.0   NA
## 5 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 6.71    0    0    0   42   32  310  310 17.0 25.9   NA
## 6 USW00094728 NY CITY CENTRAL PARK, NY US <NA> 6.71    0    0    0   46   34  290  270 13.0 19.9   NA

3. Exploratory Data Analysis

Summary of Trip durations before truncation
Min. 1st Qu. Median Mean 3rd Qu. Max.
supplied_secs 61.0000000 373.0000000 618.0000000 906.8727165 1061.0000000 1688083.0000
calc_secs 60.0000000 373.7270000 618.6000001 907.3520470 1061.9250000 1688083.0000
calc_mins 1.0000000 6.2287833 10.3100000 15.1225341 17.6987500 28134.7167
calc_hours 0.0166667 0.1038131 0.1718333 0.2520422 0.2949792 468.9119
calc_days 0.0006944 0.0043255 0.0071597 0.0105018 0.0122908 19.5380

The above indicates that the duration of the trips (in seconds) includes values in the millions – which likely reflects a trip which failed to be properly closed out.

Delete cases with unreasonable trip_duration values

Let’s assume that nobody would rent a bicycle for more than a specified timelimit (say, 3 hours), and drop any records which exceed this:

## [1] "Removed 158 trips (0.171%) of longer than 3 hours."
## [1] "Remaining number of trips: 92407"

Examine birth_year

Other inconsistencies concern the collection of birth_year, from which we can infer the age of the participant. There are some months in which this value is omitted, while there are other months in which all values are populated. However, there are a few records which suggest that the rider is a centenarian – it seems highly implausible that someone born in the 1880s is cycling around Central Park – but the data does have such anomalies. Thus, a substantial amount of time was needed for detecting and cleaning such inconsistencies.

The birth year for some users is as old as 1885, which is not possible:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1885    1969    1981    1978    1988    2003    5828

Remove trips associated with very old users (age>90)

(Also: remove trips associated with missing birth_year)

## [1] "Removed 41 trips (0.044%) of users older than 90 years."
## [1] "Removed 5828 trips (6.296%) of users where age is unknown (birth_year unspecified)."
## [1] "Remaining number of trips: 86538"

Compute distance between start and end stations

This is straight-line distance between (longitude,latitude) points – it doesn’t incorporate an actual bicycle route.
There are services (e.g., from Google) which can compute and measure a recommended bicycle route between points, but use of such services requires a subscription and incurs a cost.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.5885  1.0477  1.3090  1.7280 10.6603

In this subset of the data, the maximum distance between stations is 10.6602792 km. In the data there are some stations for which the latitude and longitude are zero, which suggests that the distance between such a station and an actual station is many thousands of miles. If such items exist, we will delete them:

Delete unusually long distances

## [1] "No unusually long distances were found in this subset of the data."

Compute usage fee

There is a time-based usage fee for rides longer than an initial period:

  • For user_type=Subscriber, the fee is $2.50 per 15 minutes following an initial free 45 minutes per ride.
  • For user_type=Customer, the fee is $4.00 per 15 minutes following an initial free 30 minutes per ride.
  • There are some cases where the user type is not specified (we have relabeled as “UNKNOWN”, and we do note estimate usage fees for such trips.)

#### Summary of trip durations AFTER censoring/truncation:

#express trip duration in seconds, minutes, hours, days
# note: we needed to fix the November daylight savings problem to eliminate negative trip times

#### Supplied seconds
#print("Supplied Seconds:")
supplied_secs<-summary(CB$trip_duration)

#### Seconds
CB$trip_duration_s = as.numeric(CB$e_time - CB$s_time,"secs")
calc_secs<-summary(CB$trip_duration_s)

#### Minutes
CB$trip_duration_m = as.numeric(CB$e_time - CB$s_time,"mins")
calc_mins<-summary(CB$trip_duration_m)

#### Hours
CB$trip_duration_h = as.numeric(CB$e_time - CB$s_time,"hours")
calc_hours<-summary(CB$trip_duration_h)

#### Days
CB$trip_duration_d = as.numeric(CB$e_time - CB$s_time,"days")
calc_days <-summary(CB$trip_duration_d)

# library(kableExtra) # loaded above
rbind(supplied_secs, calc_secs, calc_mins, calc_hours, calc_days) %>% 
  kable(caption = "Summary of trip durations - AFTER truncations:") %>%
  kable_styling(c("bordered","striped"),latex_options =  "hold_position")
Summary of trip durations - AFTER truncations:
Min. 1st Qu. Median Mean 3rd Qu. Max.
supplied_secs 61.0000000 363.0000000 592.0000000 775.7573898 994.0000000 10617.0000000
calc_secs 60.0000000 363.0000000 593.0000000 776.2454997 994.7922499 10617.5160000
calc_mins 1.0000000 6.0500000 9.8833333 12.9374250 16.5798708 176.9586000
calc_hours 0.0166667 0.1008333 0.1647222 0.2156237 0.2763312 2.9493100
calc_days 0.0006944 0.0042014 0.0068634 0.0089843 0.0115138 0.1228879

We could have chosen to censor the data, in which case we would not drop observations, but would instead move them to a limiting value, such as three hours (for trip time) or an age of 90 years (for adjusting birth_year).
As there were few such cases, we instead decided to truncate the data by dropping such observations from the dataset.

Limitations and Challenges in uploading and analyzing this data

Data Size

Because there is so much data, it is difficult to analyze the entire universe of trip-by-trip data unless one has high-performance computational resources.

Data formatting inconsistencies from month to month:
  • Data column names change slightly from month to month.
  • In some months, CitiBike specifies dates as YYYY-MM-DD, while in other months, dates are MM/DD/YYYY .
  • In certain months, the timestamps include HH:MM:SS (as well as fractional seconds) while in other months, timestamps only include HH:MM , as seconds are omitted entirely.
  • We encountered an unusual quirk which manifests itself just once a year, on the first Sunday of November, when clocks are rolled back an hour as Daylight Savings time changes to Standard time:
    • The files do not specify whether a timestamp is EDT or EST. On any other date, this is not a problem, but the hour of 1am-2am EDT on that November Sunday is followed by an hour 1am-2am EST.
    • If someone rents a bike at, say, 1:55am EDT (before the time change) and then returns it 15 minutes later, the time is now 1:10am (EST).
    • The difference in time timestamps suggests that the rental was negative 45 minutes, which is of course impossible!
  • Sometimes there is an unusually long interval between the start time of a bicycle rental and the time at which the system registers such rental as having concluded.

Correlations of individual trip data features

We can examine the correlations between variables to understand the relationship between variables, and also to help be alert to potential problems of multicollinearity. Here we compute rank correlations (Pearson and Spearman) as well as actual correlations between key variables. Here we compute the correlations between key variables on the individual CitiBike Trip data. (Later we will compute correlations on daily aggregated data which has been joined with the daily weather observations.)

Aggregate and join

Aggregate individual CitiBike trip data by day, and join to daily weather data

We will perform our calculations on an aggregated basis. We will group each day’s rides together, but we will segment by user_type (“Subscriber” or “Customer”) and by gender (“Male or”Female“). For each of these segments, there are some cases where the user_type is not specified, so we have designated that as”Unknown." For gender, there are cases where the CitiBike data set contains a zero, which indicates that the gender of the user was not recorded.

For each day, we will aggregate the following items across each of the above groupings:

  • mean trip_duration
  • median trip_duration
  • sum of distance_km
  • sum of trip_fee
  • mean of age
  • count of number of trips on that day

We will split the aggregated data into a training dataset, consisting of all (grouped, daily) aggregations from 2013-2018, and a test dataset, consisting of (grouped, daily) aggregations from 2019.

We will then join each aggregated CitiBike data element with the corresponding weather obserservation for that date.

There are 5202 rows of daily aggregated data in the training dataset, and 1870 rows in the corresponding test dataset.

5. Methodology

3.1 Descriptive Analytics:

3.1.1 Analyze data distribution and data skewness

3.1.2 Analyze feature correlations

3.1.3 Analyze Feature Importance

3.1.4 Analyze timeseries decomposition plots

3.1.5 Seasonal and Trend timeseries analysis

3.1.6 Analyze autocorrelation and partial autocorrelation

3.2 Data Correction

3.2.1 Data Imputation

3.2.2 Outlier removal

3.3 Feature Engineering

3.3.1 Create important features from date

3.3.2 Create Lagged Feature

3.3.3 Create windowing features

3.3.4 Create important features from weather data

3.3.5 Apply timeseries smoothing functions

3.3.6 Apply various feature transformation techniques (BoxCox, Log etc)

3.4 Model Building

3.4.1 Build Multiple Regression model and Random Forest with Train and Test Datasets

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
## parsnip model object
## 
## Fit time:  13ms 
## 
## Call:
## stats::lm(formula = formula, data = data)
## 
## Coefficients:
##         (Intercept)                 DATE                 AWND                 TMAX                 TMIN  
##          -7.7267141            0.0005903            0.0073968           -0.0029396            0.0010834  
##                WDF2                 WDF5                 WSF2                 WSF5  user_typeSubscriber  
##           0.0002639            0.0001031           -0.0229798            0.0029510            1.5414800  
##    user_typeUNKNOWN                train         genderFemale        genderUNKNOWN         sum_duration  
##          -0.0673986                   NA           -2.1778470           -2.7037704            0.0004949  
##     median_duration      sum_distance_km              avg_age  
##          -0.0014653            0.3946859            0.0099778
## Random Forest Model Specification (regression)
## 
## Computational engine: ranger
## 
##  iter imp variable
##   1   1  AWND  WDF2  WDF5  WSF2  WSF5
##   1   2  AWND  WDF2  WDF5  WSF2  WSF5
##   1   3  AWND  WDF2  WDF5  WSF2  WSF5
##   1   4  AWND  WDF2  WDF5  WSF2  WSF5
##   1   5  AWND  WDF2  WDF5  WSF2  WSF5
##   2   1  AWND  WDF2  WDF5  WSF2  WSF5
##   2   2  AWND  WDF2  WDF5  WSF2  WSF5
##   2   3  AWND  WDF2  WDF5  WSF2  WSF5
##   2   4  AWND  WDF2  WDF5  WSF2  WSF5
##   2   5  AWND  WDF2  WDF5  WSF2  WSF5
##   3   1  AWND  WDF2  WDF5  WSF2  WSF5
##   3   2  AWND  WDF2  WDF5  WSF2  WSF5
##   3   3  AWND  WDF2  WDF5  WSF2  WSF5
##   3   4  AWND  WDF2  WDF5  WSF2  WSF5
##   3   5  AWND  WDF2  WDF5  WSF2  WSF5
##   4   1  AWND  WDF2  WDF5  WSF2  WSF5
##   4   2  AWND  WDF2  WDF5  WSF2  WSF5
##   4   3  AWND  WDF2  WDF5  WSF2  WSF5
##   4   4  AWND  WDF2  WDF5  WSF2  WSF5
##   4   5  AWND  WDF2  WDF5  WSF2  WSF5
##   5   1  AWND  WDF2  WDF5  WSF2  WSF5
##   5   2  AWND  WDF2  WDF5  WSF2  WSF5
##   5   3  AWND  WDF2  WDF5  WSF2  WSF5
##   5   4  AWND  WDF2  WDF5  WSF2  WSF5
##   5   5  AWND  WDF2  WDF5  WSF2  WSF5
##   6   1  AWND  WDF2  WDF5  WSF2  WSF5
##   6   2  AWND  WDF2  WDF5  WSF2  WSF5
##   6   3  AWND  WDF2  WDF5  WSF2  WSF5
##   6   4  AWND  WDF2  WDF5  WSF2  WSF5
##   6   5  AWND  WDF2  WDF5  WSF2  WSF5
##   7   1  AWND  WDF2  WDF5  WSF2  WSF5
##   7   2  AWND  WDF2  WDF5  WSF2  WSF5
##   7   3  AWND  WDF2  WDF5  WSF2  WSF5
##   7   4  AWND  WDF2  WDF5  WSF2  WSF5
##   7   5  AWND  WDF2  WDF5  WSF2  WSF5
##   8   1  AWND  WDF2  WDF5  WSF2  WSF5
##   8   2  AWND  WDF2  WDF5  WSF2  WSF5
##   8   3  AWND  WDF2  WDF5  WSF2  WSF5
##   8   4  AWND  WDF2  WDF5  WSF2  WSF5
##   8   5  AWND  WDF2  WDF5  WSF2  WSF5
##   9   1  AWND  WDF2  WDF5  WSF2  WSF5
##   9   2  AWND  WDF2  WDF5  WSF2  WSF5
##   9   3  AWND  WDF2  WDF5  WSF2  WSF5
##   9   4  AWND  WDF2  WDF5  WSF2  WSF5
##   9   5  AWND  WDF2  WDF5  WSF2  WSF5
##   10   1  AWND  WDF2  WDF5  WSF2  WSF5
##   10   2  AWND  WDF2  WDF5  WSF2  WSF5
##   10   3  AWND  WDF2  WDF5  WSF2  WSF5
##   10   4  AWND  WDF2  WDF5  WSF2  WSF5
##   10   5  AWND  WDF2  WDF5  WSF2  WSF5
##   11   1  AWND  WDF2  WDF5  WSF2  WSF5
##   11   2  AWND  WDF2  WDF5  WSF2  WSF5
##   11   3  AWND  WDF2  WDF5  WSF2  WSF5
##   11   4  AWND  WDF2  WDF5  WSF2  WSF5
##   11   5  AWND  WDF2  WDF5  WSF2  WSF5
##   12   1  AWND  WDF2  WDF5  WSF2  WSF5
##   12   2  AWND  WDF2  WDF5  WSF2  WSF5
##   12   3  AWND  WDF2  WDF5  WSF2  WSF5
##   12   4  AWND  WDF2  WDF5  WSF2  WSF5
##   12   5  AWND  WDF2  WDF5  WSF2  WSF5
##   13   1  AWND  WDF2  WDF5  WSF2  WSF5
##   13   2  AWND  WDF2  WDF5  WSF2  WSF5
##   13   3  AWND  WDF2  WDF5  WSF2  WSF5
##   13   4  AWND  WDF2  WDF5  WSF2  WSF5
##   13   5  AWND  WDF2  WDF5  WSF2  WSF5
##   14   1  AWND  WDF2  WDF5  WSF2  WSF5
##   14   2  AWND  WDF2  WDF5  WSF2  WSF5
##   14   3  AWND  WDF2  WDF5  WSF2  WSF5
##   14   4  AWND  WDF2  WDF5  WSF2  WSF5
##   14   5  AWND  WDF2  WDF5  WSF2  WSF5
##   15   1  AWND  WDF2  WDF5  WSF2  WSF5
##   15   2  AWND  WDF2  WDF5  WSF2  WSF5
##   15   3  AWND  WDF2  WDF5  WSF2  WSF5
##   15   4  AWND  WDF2  WDF5  WSF2  WSF5
##   15   5  AWND  WDF2  WDF5  WSF2  WSF5
##   16   1  AWND  WDF2  WDF5  WSF2  WSF5
##   16   2  AWND  WDF2  WDF5  WSF2  WSF5
##   16   3  AWND  WDF2  WDF5  WSF2  WSF5
##   16   4  AWND  WDF2  WDF5  WSF2  WSF5
##   16   5  AWND  WDF2  WDF5  WSF2  WSF5
##   17   1  AWND  WDF2  WDF5  WSF2  WSF5
##   17   2  AWND  WDF2  WDF5  WSF2  WSF5
##   17   3  AWND  WDF2  WDF5  WSF2  WSF5
##   17   4  AWND  WDF2  WDF5  WSF2  WSF5
##   17   5  AWND  WDF2  WDF5  WSF2  WSF5
##   18   1  AWND  WDF2  WDF5  WSF2  WSF5
##   18   2  AWND  WDF2  WDF5  WSF2  WSF5
##   18   3  AWND  WDF2  WDF5  WSF2  WSF5
##   18   4  AWND  WDF2  WDF5  WSF2  WSF5
##   18   5  AWND  WDF2  WDF5  WSF2  WSF5
##   19   1  AWND  WDF2  WDF5  WSF2  WSF5
##   19   2  AWND  WDF2  WDF5  WSF2  WSF5
##   19   3  AWND  WDF2  WDF5  WSF2  WSF5
##   19   4  AWND  WDF2  WDF5  WSF2  WSF5
##   19   5  AWND  WDF2  WDF5  WSF2  WSF5
##   20   1  AWND  WDF2  WDF5  WSF2  WSF5
##   20   2  AWND  WDF2  WDF5  WSF2  WSF5
##   20   3  AWND  WDF2  WDF5  WSF2  WSF5
##   20   4  AWND  WDF2  WDF5  WSF2  WSF5
##   20   5  AWND  WDF2  WDF5  WSF2  WSF5
##   21   1  AWND  WDF2  WDF5  WSF2  WSF5
##   21   2  AWND  WDF2  WDF5  WSF2  WSF5
##   21   3  AWND  WDF2  WDF5  WSF2  WSF5
##   21   4  AWND  WDF2  WDF5  WSF2  WSF5
##   21   5  AWND  WDF2  WDF5  WSF2  WSF5
##   22   1  AWND  WDF2  WDF5  WSF2  WSF5
##   22   2  AWND  WDF2  WDF5  WSF2  WSF5
##   22   3  AWND  WDF2  WDF5  WSF2  WSF5
##   22   4  AWND  WDF2  WDF5  WSF2  WSF5
##   22   5  AWND  WDF2  WDF5  WSF2  WSF5
##   23   1  AWND  WDF2  WDF5  WSF2  WSF5
##   23   2  AWND  WDF2  WDF5  WSF2  WSF5
##   23   3  AWND  WDF2  WDF5  WSF2  WSF5
##   23   4  AWND  WDF2  WDF5  WSF2  WSF5
##   23   5  AWND  WDF2  WDF5  WSF2  WSF5
##   24   1  AWND  WDF2  WDF5  WSF2  WSF5
##   24   2  AWND  WDF2  WDF5  WSF2  WSF5
##   24   3  AWND  WDF2  WDF5  WSF2  WSF5
##   24   4  AWND  WDF2  WDF5  WSF2  WSF5
##   24   5  AWND  WDF2  WDF5  WSF2  WSF5
##   25   1  AWND  WDF2  WDF5  WSF2  WSF5
##   25   2  AWND  WDF2  WDF5  WSF2  WSF5
##   25   3  AWND  WDF2  WDF5  WSF2  WSF5
##   25   4  AWND  WDF2  WDF5  WSF2  WSF5
##   25   5  AWND  WDF2  WDF5  WSF2  WSF5
##   26   1  AWND  WDF2  WDF5  WSF2  WSF5
##   26   2  AWND  WDF2  WDF5  WSF2  WSF5
##   26   3  AWND  WDF2  WDF5  WSF2  WSF5
##   26   4  AWND  WDF2  WDF5  WSF2  WSF5
##   26   5  AWND  WDF2  WDF5  WSF2  WSF5
##   27   1  AWND  WDF2  WDF5  WSF2  WSF5
##   27   2  AWND  WDF2  WDF5  WSF2  WSF5
##   27   3  AWND  WDF2  WDF5  WSF2  WSF5
##   27   4  AWND  WDF2  WDF5  WSF2  WSF5
##   27   5  AWND  WDF2  WDF5  WSF2  WSF5
##   28   1  AWND  WDF2  WDF5  WSF2  WSF5
##   28   2  AWND  WDF2  WDF5  WSF2  WSF5
##   28   3  AWND  WDF2  WDF5  WSF2  WSF5
##   28   4  AWND  WDF2  WDF5  WSF2  WSF5
##   28   5  AWND  WDF2  WDF5  WSF2  WSF5
##   29   1  AWND  WDF2  WDF5  WSF2  WSF5
##   29   2  AWND  WDF2  WDF5  WSF2  WSF5
##   29   3  AWND  WDF2  WDF5  WSF2  WSF5
##   29   4  AWND  WDF2  WDF5  WSF2  WSF5
##   29   5  AWND  WDF2  WDF5  WSF2  WSF5
##   30   1  AWND  WDF2  WDF5  WSF2  WSF5
##   30   2  AWND  WDF2  WDF5  WSF2  WSF5
##   30   3  AWND  WDF2  WDF5  WSF2  WSF5
##   30   4  AWND  WDF2  WDF5  WSF2  WSF5
##   30   5  AWND  WDF2  WDF5  WSF2  WSF5
##   31   1  AWND  WDF2  WDF5  WSF2  WSF5
##   31   2  AWND  WDF2  WDF5  WSF2  WSF5
##   31   3  AWND  WDF2  WDF5  WSF2  WSF5
##   31   4  AWND  WDF2  WDF5  WSF2  WSF5
##   31   5  AWND  WDF2  WDF5  WSF2  WSF5
##   32   1  AWND  WDF2  WDF5  WSF2  WSF5
##   32   2  AWND  WDF2  WDF5  WSF2  WSF5
##   32   3  AWND  WDF2  WDF5  WSF2  WSF5
##   32   4  AWND  WDF2  WDF5  WSF2  WSF5
##   32   5  AWND  WDF2  WDF5  WSF2  WSF5
##   33   1  AWND  WDF2  WDF5  WSF2  WSF5
##   33   2  AWND  WDF2  WDF5  WSF2  WSF5
##   33   3  AWND  WDF2  WDF5  WSF2  WSF5
##   33   4  AWND  WDF2  WDF5  WSF2  WSF5
##   33   5  AWND  WDF2  WDF5  WSF2  WSF5
##   34   1  AWND  WDF2  WDF5  WSF2  WSF5
##   34   2  AWND  WDF2  WDF5  WSF2  WSF5
##   34   3  AWND  WDF2  WDF5  WSF2  WSF5
##   34   4  AWND  WDF2  WDF5  WSF2  WSF5
##   34   5  AWND  WDF2  WDF5  WSF2  WSF5
##   35   1  AWND  WDF2  WDF5  WSF2  WSF5
##   35   2  AWND  WDF2  WDF5  WSF2  WSF5
##   35   3  AWND  WDF2  WDF5  WSF2  WSF5
##   35   4  AWND  WDF2  WDF5  WSF2  WSF5
##   35   5  AWND  WDF2  WDF5  WSF2  WSF5
##   36   1  AWND  WDF2  WDF5  WSF2  WSF5
##   36   2  AWND  WDF2  WDF5  WSF2  WSF5
##   36   3  AWND  WDF2  WDF5  WSF2  WSF5
##   36   4  AWND  WDF2  WDF5  WSF2  WSF5
##   36   5  AWND  WDF2  WDF5  WSF2  WSF5
##   37   1  AWND  WDF2  WDF5  WSF2  WSF5
##   37   2  AWND  WDF2  WDF5  WSF2  WSF5
##   37   3  AWND  WDF2  WDF5  WSF2  WSF5
##   37   4  AWND  WDF2  WDF5  WSF2  WSF5
##   37   5  AWND  WDF2  WDF5  WSF2  WSF5
##   38   1  AWND  WDF2  WDF5  WSF2  WSF5
##   38   2  AWND  WDF2  WDF5  WSF2  WSF5
##   38   3  AWND  WDF2  WDF5  WSF2  WSF5
##   38   4  AWND  WDF2  WDF5  WSF2  WSF5
##   38   5  AWND  WDF2  WDF5  WSF2  WSF5
##   39   1  AWND  WDF2  WDF5  WSF2  WSF5
##   39   2  AWND  WDF2  WDF5  WSF2  WSF5
##   39   3  AWND  WDF2  WDF5  WSF2  WSF5
##   39   4  AWND  WDF2  WDF5  WSF2  WSF5
##   39   5  AWND  WDF2  WDF5  WSF2  WSF5
##   40   1  AWND  WDF2  WDF5  WSF2  WSF5
##   40   2  AWND  WDF2  WDF5  WSF2  WSF5
##   40   3  AWND  WDF2  WDF5  WSF2  WSF5
##   40   4  AWND  WDF2  WDF5  WSF2  WSF5
##   40   5  AWND  WDF2  WDF5  WSF2  WSF5
##   41   1  AWND  WDF2  WDF5  WSF2  WSF5
##   41   2  AWND  WDF2  WDF5  WSF2  WSF5
##   41   3  AWND  WDF2  WDF5  WSF2  WSF5
##   41   4  AWND  WDF2  WDF5  WSF2  WSF5
##   41   5  AWND  WDF2  WDF5  WSF2  WSF5
##   42   1  AWND  WDF2  WDF5  WSF2  WSF5
##   42   2  AWND  WDF2  WDF5  WSF2  WSF5
##   42   3  AWND  WDF2  WDF5  WSF2  WSF5
##   42   4  AWND  WDF2  WDF5  WSF2  WSF5
##   42   5  AWND  WDF2  WDF5  WSF2  WSF5
##   43   1  AWND  WDF2  WDF5  WSF2  WSF5
##   43   2  AWND  WDF2  WDF5  WSF2  WSF5
##   43   3  AWND  WDF2  WDF5  WSF2  WSF5
##   43   4  AWND  WDF2  WDF5  WSF2  WSF5
##   43   5  AWND  WDF2  WDF5  WSF2  WSF5
##   44   1  AWND  WDF2  WDF5  WSF2  WSF5
##   44   2  AWND  WDF2  WDF5  WSF2  WSF5
##   44   3  AWND  WDF2  WDF5  WSF2  WSF5
##   44   4  AWND  WDF2  WDF5  WSF2  WSF5
##   44   5  AWND  WDF2  WDF5  WSF2  WSF5
##   45   1  AWND  WDF2  WDF5  WSF2  WSF5
##   45   2  AWND  WDF2  WDF5  WSF2  WSF5
##   45   3  AWND  WDF2  WDF5  WSF2  WSF5
##   45   4  AWND  WDF2  WDF5  WSF2  WSF5
##   45   5  AWND  WDF2  WDF5  WSF2  WSF5
##   46   1  AWND  WDF2  WDF5  WSF2  WSF5
##   46   2  AWND  WDF2  WDF5  WSF2  WSF5
##   46   3  AWND  WDF2  WDF5  WSF2  WSF5
##   46   4  AWND  WDF2  WDF5  WSF2  WSF5
##   46   5  AWND  WDF2  WDF5  WSF2  WSF5
##   47   1  AWND  WDF2  WDF5  WSF2  WSF5
##   47   2  AWND  WDF2  WDF5  WSF2  WSF5
##   47   3  AWND  WDF2  WDF5  WSF2  WSF5
##   47   4  AWND  WDF2  WDF5  WSF2  WSF5
##   47   5  AWND  WDF2  WDF5  WSF2  WSF5
##   48   1  AWND  WDF2  WDF5  WSF2  WSF5
##   48   2  AWND  WDF2  WDF5  WSF2  WSF5
##   48   3  AWND  WDF2  WDF5  WSF2  WSF5
##   48   4  AWND  WDF2  WDF5  WSF2  WSF5
##   48   5  AWND  WDF2  WDF5  WSF2  WSF5
##   49   1  AWND  WDF2  WDF5  WSF2  WSF5
##   49   2  AWND  WDF2  WDF5  WSF2  WSF5
##   49   3  AWND  WDF2  WDF5  WSF2  WSF5
##   49   4  AWND  WDF2  WDF5  WSF2  WSF5
##   49   5  AWND  WDF2  WDF5  WSF2  WSF5
##   50   1  AWND  WDF2  WDF5  WSF2  WSF5
##   50   2  AWND  WDF2  WDF5  WSF2  WSF5
##   50   3  AWND  WDF2  WDF5  WSF2  WSF5
##   50   4  AWND  WDF2  WDF5  WSF2  WSF5
##   50   5  AWND  WDF2  WDF5  WSF2  WSF5
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##            DATE            AWND            TMAX            TMIN            WDF2            WDF5            WSF2 
##              ""           "pmm"              ""              ""           "pmm"           "pmm"           "pmm" 
##            WSF5       user_type           train          gender    sum_duration median_duration sum_distance_km 
##           "pmm"              ""              ""              ""              ""              ""              "" 
##         avg_age           trips 
##              ""              "" 
## PredictorMatrix:
##      DATE AWND TMAX TMIN WDF2 WDF5 WSF2 WSF5 user_type train gender sum_duration median_duration sum_distance_km
## DATE    0    1    1    1    1    1    1    1         1     0      1            1               1               1
## AWND    1    0    1    1    1    1    1    1         1     0      1            1               1               1
## TMAX    1    1    0    1    1    1    1    1         1     0      1            1               1               1
## TMIN    1    1    1    0    1    1    1    1         1     0      1            1               1               1
## WDF2    1    1    1    1    0    1    1    1         1     0      1            1               1               1
## WDF5    1    1    1    1    1    0    1    1         1     0      1            1               1               1
##      avg_age trips
## DATE       1     1
## AWND       1     1
## TMAX       1     1
## TMIN       1     1
## WDF2       1     1
## WDF5       1     1
## Number of logged events:  1 
##   it im dep     meth   out
## 1  0  0     constant train
## 
##  iter imp variable
##   1   1  AWND  WDF2  WDF5  WSF2  WSF5
##   1   2  AWND  WDF2  WDF5  WSF2  WSF5
##   1   3  AWND  WDF2  WDF5  WSF2  WSF5
##   1   4  AWND  WDF2  WDF5  WSF2  WSF5
##   1   5  AWND  WDF2  WDF5  WSF2  WSF5
##   2   1  AWND  WDF2  WDF5  WSF2  WSF5
##   2   2  AWND  WDF2  WDF5  WSF2  WSF5
##   2   3  AWND  WDF2  WDF5  WSF2  WSF5
##   2   4  AWND  WDF2  WDF5  WSF2  WSF5
##   2   5  AWND  WDF2  WDF5  WSF2  WSF5
##   3   1  AWND  WDF2  WDF5  WSF2  WSF5
##   3   2  AWND  WDF2  WDF5  WSF2  WSF5
##   3   3  AWND  WDF2  WDF5  WSF2  WSF5
##   3   4  AWND  WDF2  WDF5  WSF2  WSF5
##   3   5  AWND  WDF2  WDF5  WSF2  WSF5
##   4   1  AWND  WDF2  WDF5  WSF2  WSF5
##   4   2  AWND  WDF2  WDF5  WSF2  WSF5
##   4   3  AWND  WDF2  WDF5  WSF2  WSF5
##   4   4  AWND  WDF2  WDF5  WSF2  WSF5
##   4   5  AWND  WDF2  WDF5  WSF2  WSF5
##   5   1  AWND  WDF2  WDF5  WSF2  WSF5
##   5   2  AWND  WDF2  WDF5  WSF2  WSF5
##   5   3  AWND  WDF2  WDF5  WSF2  WSF5
##   5   4  AWND  WDF2  WDF5  WSF2  WSF5
##   5   5  AWND  WDF2  WDF5  WSF2  WSF5
##   6   1  AWND  WDF2  WDF5  WSF2  WSF5
##   6   2  AWND  WDF2  WDF5  WSF2  WSF5
##   6   3  AWND  WDF2  WDF5  WSF2  WSF5
##   6   4  AWND  WDF2  WDF5  WSF2  WSF5
##   6   5  AWND  WDF2  WDF5  WSF2  WSF5
##   7   1  AWND  WDF2  WDF5  WSF2  WSF5
##   7   2  AWND  WDF2  WDF5  WSF2  WSF5
##   7   3  AWND  WDF2  WDF5  WSF2  WSF5
##   7   4  AWND  WDF2  WDF5  WSF2  WSF5
##   7   5  AWND  WDF2  WDF5  WSF2  WSF5
##   8   1  AWND  WDF2  WDF5  WSF2  WSF5
##   8   2  AWND  WDF2  WDF5  WSF2  WSF5
##   8   3  AWND  WDF2  WDF5  WSF2  WSF5
##   8   4  AWND  WDF2  WDF5  WSF2  WSF5
##   8   5  AWND  WDF2  WDF5  WSF2  WSF5
##   9   1  AWND  WDF2  WDF5  WSF2  WSF5
##   9   2  AWND  WDF2  WDF5  WSF2  WSF5
##   9   3  AWND  WDF2  WDF5  WSF2  WSF5
##   9   4  AWND  WDF2  WDF5  WSF2  WSF5
##   9   5  AWND  WDF2  WDF5  WSF2  WSF5
##   10   1  AWND  WDF2  WDF5  WSF2  WSF5
##   10   2  AWND  WDF2  WDF5  WSF2  WSF5
##   10   3  AWND  WDF2  WDF5  WSF2  WSF5
##   10   4  AWND  WDF2  WDF5  WSF2  WSF5
##   10   5  AWND  WDF2  WDF5  WSF2  WSF5
##   11   1  AWND  WDF2  WDF5  WSF2  WSF5
##   11   2  AWND  WDF2  WDF5  WSF2  WSF5
##   11   3  AWND  WDF2  WDF5  WSF2  WSF5
##   11   4  AWND  WDF2  WDF5  WSF2  WSF5
##   11   5  AWND  WDF2  WDF5  WSF2  WSF5
##   12   1  AWND  WDF2  WDF5  WSF2  WSF5
##   12   2  AWND  WDF2  WDF5  WSF2  WSF5
##   12   3  AWND  WDF2  WDF5  WSF2  WSF5
##   12   4  AWND  WDF2  WDF5  WSF2  WSF5
##   12   5  AWND  WDF2  WDF5  WSF2  WSF5
##   13   1  AWND  WDF2  WDF5  WSF2  WSF5
##   13   2  AWND  WDF2  WDF5  WSF2  WSF5
##   13   3  AWND  WDF2  WDF5  WSF2  WSF5
##   13   4  AWND  WDF2  WDF5  WSF2  WSF5
##   13   5  AWND  WDF2  WDF5  WSF2  WSF5
##   14   1  AWND  WDF2  WDF5  WSF2  WSF5
##   14   2  AWND  WDF2  WDF5  WSF2  WSF5
##   14   3  AWND  WDF2  WDF5  WSF2  WSF5
##   14   4  AWND  WDF2  WDF5  WSF2  WSF5
##   14   5  AWND  WDF2  WDF5  WSF2  WSF5
##   15   1  AWND  WDF2  WDF5  WSF2  WSF5
##   15   2  AWND  WDF2  WDF5  WSF2  WSF5
##   15   3  AWND  WDF2  WDF5  WSF2  WSF5
##   15   4  AWND  WDF2  WDF5  WSF2  WSF5
##   15   5  AWND  WDF2  WDF5  WSF2  WSF5
##   16   1  AWND  WDF2  WDF5  WSF2  WSF5
##   16   2  AWND  WDF2  WDF5  WSF2  WSF5
##   16   3  AWND  WDF2  WDF5  WSF2  WSF5
##   16   4  AWND  WDF2  WDF5  WSF2  WSF5
##   16   5  AWND  WDF2  WDF5  WSF2  WSF5
##   17   1  AWND  WDF2  WDF5  WSF2  WSF5
##   17   2  AWND  WDF2  WDF5  WSF2  WSF5
##   17   3  AWND  WDF2  WDF5  WSF2  WSF5
##   17   4  AWND  WDF2  WDF5  WSF2  WSF5
##   17   5  AWND  WDF2  WDF5  WSF2  WSF5
##   18   1  AWND  WDF2  WDF5  WSF2  WSF5
##   18   2  AWND  WDF2  WDF5  WSF2  WSF5
##   18   3  AWND  WDF2  WDF5  WSF2  WSF5
##   18   4  AWND  WDF2  WDF5  WSF2  WSF5
##   18   5  AWND  WDF2  WDF5  WSF2  WSF5
##   19   1  AWND  WDF2  WDF5  WSF2  WSF5
##   19   2  AWND  WDF2  WDF5  WSF2  WSF5
##   19   3  AWND  WDF2  WDF5  WSF2  WSF5
##   19   4  AWND  WDF2  WDF5  WSF2  WSF5
##   19   5  AWND  WDF2  WDF5  WSF2  WSF5
##   20   1  AWND  WDF2  WDF5  WSF2  WSF5
##   20   2  AWND  WDF2  WDF5  WSF2  WSF5
##   20   3  AWND  WDF2  WDF5  WSF2  WSF5
##   20   4  AWND  WDF2  WDF5  WSF2  WSF5
##   20   5  AWND  WDF2  WDF5  WSF2  WSF5
##   21   1  AWND  WDF2  WDF5  WSF2  WSF5
##   21   2  AWND  WDF2  WDF5  WSF2  WSF5
##   21   3  AWND  WDF2  WDF5  WSF2  WSF5
##   21   4  AWND  WDF2  WDF5  WSF2  WSF5
##   21   5  AWND  WDF2  WDF5  WSF2  WSF5
##   22   1  AWND  WDF2  WDF5  WSF2  WSF5
##   22   2  AWND  WDF2  WDF5  WSF2  WSF5
##   22   3  AWND  WDF2  WDF5  WSF2  WSF5
##   22   4  AWND  WDF2  WDF5  WSF2  WSF5
##   22   5  AWND  WDF2  WDF5  WSF2  WSF5
##   23   1  AWND  WDF2  WDF5  WSF2  WSF5
##   23   2  AWND  WDF2  WDF5  WSF2  WSF5
##   23   3  AWND  WDF2  WDF5  WSF2  WSF5
##   23   4  AWND  WDF2  WDF5  WSF2  WSF5
##   23   5  AWND  WDF2  WDF5  WSF2  WSF5
##   24   1  AWND  WDF2  WDF5  WSF2  WSF5
##   24   2  AWND  WDF2  WDF5  WSF2  WSF5
##   24   3  AWND  WDF2  WDF5  WSF2  WSF5
##   24   4  AWND  WDF2  WDF5  WSF2  WSF5
##   24   5  AWND  WDF2  WDF5  WSF2  WSF5
##   25   1  AWND  WDF2  WDF5  WSF2  WSF5
##   25   2  AWND  WDF2  WDF5  WSF2  WSF5
##   25   3  AWND  WDF2  WDF5  WSF2  WSF5
##   25   4  AWND  WDF2  WDF5  WSF2  WSF5
##   25   5  AWND  WDF2  WDF5  WSF2  WSF5
##   26   1  AWND  WDF2  WDF5  WSF2  WSF5
##   26   2  AWND  WDF2  WDF5  WSF2  WSF5
##   26   3  AWND  WDF2  WDF5  WSF2  WSF5
##   26   4  AWND  WDF2  WDF5  WSF2  WSF5
##   26   5  AWND  WDF2  WDF5  WSF2  WSF5
##   27   1  AWND  WDF2  WDF5  WSF2  WSF5
##   27   2  AWND  WDF2  WDF5  WSF2  WSF5
##   27   3  AWND  WDF2  WDF5  WSF2  WSF5
##   27   4  AWND  WDF2  WDF5  WSF2  WSF5
##   27   5  AWND  WDF2  WDF5  WSF2  WSF5
##   28   1  AWND  WDF2  WDF5  WSF2  WSF5
##   28   2  AWND  WDF2  WDF5  WSF2  WSF5
##   28   3  AWND  WDF2  WDF5  WSF2  WSF5
##   28   4  AWND  WDF2  WDF5  WSF2  WSF5
##   28   5  AWND  WDF2  WDF5  WSF2  WSF5
##   29   1  AWND  WDF2  WDF5  WSF2  WSF5
##   29   2  AWND  WDF2  WDF5  WSF2  WSF5
##   29   3  AWND  WDF2  WDF5  WSF2  WSF5
##   29   4  AWND  WDF2  WDF5  WSF2  WSF5
##   29   5  AWND  WDF2  WDF5  WSF2  WSF5
##   30   1  AWND  WDF2  WDF5  WSF2  WSF5
##   30   2  AWND  WDF2  WDF5  WSF2  WSF5
##   30   3  AWND  WDF2  WDF5  WSF2  WSF5
##   30   4  AWND  WDF2  WDF5  WSF2  WSF5
##   30   5  AWND  WDF2  WDF5  WSF2  WSF5
##   31   1  AWND  WDF2  WDF5  WSF2  WSF5
##   31   2  AWND  WDF2  WDF5  WSF2  WSF5
##   31   3  AWND  WDF2  WDF5  WSF2  WSF5
##   31   4  AWND  WDF2  WDF5  WSF2  WSF5
##   31   5  AWND  WDF2  WDF5  WSF2  WSF5
##   32   1  AWND  WDF2  WDF5  WSF2  WSF5
##   32   2  AWND  WDF2  WDF5  WSF2  WSF5
##   32   3  AWND  WDF2  WDF5  WSF2  WSF5
##   32   4  AWND  WDF2  WDF5  WSF2  WSF5
##   32   5  AWND  WDF2  WDF5  WSF2  WSF5
##   33   1  AWND  WDF2  WDF5  WSF2  WSF5
##   33   2  AWND  WDF2  WDF5  WSF2  WSF5
##   33   3  AWND  WDF2  WDF5  WSF2  WSF5
##   33   4  AWND  WDF2  WDF5  WSF2  WSF5
##   33   5  AWND  WDF2  WDF5  WSF2  WSF5
##   34   1  AWND  WDF2  WDF5  WSF2  WSF5
##   34   2  AWND  WDF2  WDF5  WSF2  WSF5
##   34   3  AWND  WDF2  WDF5  WSF2  WSF5
##   34   4  AWND  WDF2  WDF5  WSF2  WSF5
##   34   5  AWND  WDF2  WDF5  WSF2  WSF5
##   35   1  AWND  WDF2  WDF5  WSF2  WSF5
##   35   2  AWND  WDF2  WDF5  WSF2  WSF5
##   35   3  AWND  WDF2  WDF5  WSF2  WSF5
##   35   4  AWND  WDF2  WDF5  WSF2  WSF5
##   35   5  AWND  WDF2  WDF5  WSF2  WSF5
##   36   1  AWND  WDF2  WDF5  WSF2  WSF5
##   36   2  AWND  WDF2  WDF5  WSF2  WSF5
##   36   3  AWND  WDF2  WDF5  WSF2  WSF5
##   36   4  AWND  WDF2  WDF5  WSF2  WSF5
##   36   5  AWND  WDF2  WDF5  WSF2  WSF5
##   37   1  AWND  WDF2  WDF5  WSF2  WSF5
##   37   2  AWND  WDF2  WDF5  WSF2  WSF5
##   37   3  AWND  WDF2  WDF5  WSF2  WSF5
##   37   4  AWND  WDF2  WDF5  WSF2  WSF5
##   37   5  AWND  WDF2  WDF5  WSF2  WSF5
##   38   1  AWND  WDF2  WDF5  WSF2  WSF5
##   38   2  AWND  WDF2  WDF5  WSF2  WSF5
##   38   3  AWND  WDF2  WDF5  WSF2  WSF5
##   38   4  AWND  WDF2  WDF5  WSF2  WSF5
##   38   5  AWND  WDF2  WDF5  WSF2  WSF5
##   39   1  AWND  WDF2  WDF5  WSF2  WSF5
##   39   2  AWND  WDF2  WDF5  WSF2  WSF5
##   39   3  AWND  WDF2  WDF5  WSF2  WSF5
##   39   4  AWND  WDF2  WDF5  WSF2  WSF5
##   39   5  AWND  WDF2  WDF5  WSF2  WSF5
##   40   1  AWND  WDF2  WDF5  WSF2  WSF5
##   40   2  AWND  WDF2  WDF5  WSF2  WSF5
##   40   3  AWND  WDF2  WDF5  WSF2  WSF5
##   40   4  AWND  WDF2  WDF5  WSF2  WSF5
##   40   5  AWND  WDF2  WDF5  WSF2  WSF5
##   41   1  AWND  WDF2  WDF5  WSF2  WSF5
##   41   2  AWND  WDF2  WDF5  WSF2  WSF5
##   41   3  AWND  WDF2  WDF5  WSF2  WSF5
##   41   4  AWND  WDF2  WDF5  WSF2  WSF5
##   41   5  AWND  WDF2  WDF5  WSF2  WSF5
##   42   1  AWND  WDF2  WDF5  WSF2  WSF5
##   42   2  AWND  WDF2  WDF5  WSF2  WSF5
##   42   3  AWND  WDF2  WDF5  WSF2  WSF5
##   42   4  AWND  WDF2  WDF5  WSF2  WSF5
##   42   5  AWND  WDF2  WDF5  WSF2  WSF5
##   43   1  AWND  WDF2  WDF5  WSF2  WSF5
##   43   2  AWND  WDF2  WDF5  WSF2  WSF5
##   43   3  AWND  WDF2  WDF5  WSF2  WSF5
##   43   4  AWND  WDF2  WDF5  WSF2  WSF5
##   43   5  AWND  WDF2  WDF5  WSF2  WSF5
##   44   1  AWND  WDF2  WDF5  WSF2  WSF5
##   44   2  AWND  WDF2  WDF5  WSF2  WSF5
##   44   3  AWND  WDF2  WDF5  WSF2  WSF5
##   44   4  AWND  WDF2  WDF5  WSF2  WSF5
##   44   5  AWND  WDF2  WDF5  WSF2  WSF5
##   45   1  AWND  WDF2  WDF5  WSF2  WSF5
##   45   2  AWND  WDF2  WDF5  WSF2  WSF5
##   45   3  AWND  WDF2  WDF5  WSF2  WSF5
##   45   4  AWND  WDF2  WDF5  WSF2  WSF5
##   45   5  AWND  WDF2  WDF5  WSF2  WSF5
##   46   1  AWND  WDF2  WDF5  WSF2  WSF5
##   46   2  AWND  WDF2  WDF5  WSF2  WSF5
##   46   3  AWND  WDF2  WDF5  WSF2  WSF5
##   46   4  AWND  WDF2  WDF5  WSF2  WSF5
##   46   5  AWND  WDF2  WDF5  WSF2  WSF5
##   47   1  AWND  WDF2  WDF5  WSF2  WSF5
##   47   2  AWND  WDF2  WDF5  WSF2  WSF5
##   47   3  AWND  WDF2  WDF5  WSF2  WSF5
##   47   4  AWND  WDF2  WDF5  WSF2  WSF5
##   47   5  AWND  WDF2  WDF5  WSF2  WSF5
##   48   1  AWND  WDF2  WDF5  WSF2  WSF5
##   48   2  AWND  WDF2  WDF5  WSF2  WSF5
##   48   3  AWND  WDF2  WDF5  WSF2  WSF5
##   48   4  AWND  WDF2  WDF5  WSF2  WSF5
##   48   5  AWND  WDF2  WDF5  WSF2  WSF5
##   49   1  AWND  WDF2  WDF5  WSF2  WSF5
##   49   2  AWND  WDF2  WDF5  WSF2  WSF5
##   49   3  AWND  WDF2  WDF5  WSF2  WSF5
##   49   4  AWND  WDF2  WDF5  WSF2  WSF5
##   49   5  AWND  WDF2  WDF5  WSF2  WSF5
##   50   1  AWND  WDF2  WDF5  WSF2  WSF5
##   50   2  AWND  WDF2  WDF5  WSF2  WSF5
##   50   3  AWND  WDF2  WDF5  WSF2  WSF5
##   50   4  AWND  WDF2  WDF5  WSF2  WSF5
##   50   5  AWND  WDF2  WDF5  WSF2  WSF5
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##            DATE            AWND            TMAX            TMIN            WDF2            WDF5            WSF2 
##              ""           "pmm"              ""              ""           "pmm"           "pmm"           "pmm" 
##            WSF5       user_type           train          gender    sum_duration median_duration sum_distance_km 
##           "pmm"              ""              ""              ""              ""              ""              "" 
##         avg_age           trips 
##              ""              "" 
## PredictorMatrix:
##      DATE AWND TMAX TMIN WDF2 WDF5 WSF2 WSF5 user_type train gender sum_duration median_duration sum_distance_km
## DATE    0    1    1    1    1    1    1    1         1     0      1            1               1               1
## AWND    1    0    1    1    1    1    1    1         1     0      1            1               1               1
## TMAX    1    1    0    1    1    1    1    1         1     0      1            1               1               1
## TMIN    1    1    1    0    1    1    1    1         1     0      1            1               1               1
## WDF2    1    1    1    1    0    1    1    1         1     0      1            1               1               1
## WDF5    1    1    1    1    1    0    1    1         1     0      1            1               1               1
##      avg_age trips
## DATE       1     1
## AWND       1     1
## TMAX       1     1
## TMIN       1     1
## WDF2       1     1
## WDF5       1     1
## Number of logged events:  1251 
##   it im  dep     meth              out
## 1  0  0      constant            train
## 2  1  1 AWND      pmm user_typeUNKNOWN
## 3  1  1 WDF2      pmm user_typeUNKNOWN
## 4  1  1 WDF5      pmm user_typeUNKNOWN
## 5  1  1 WSF2      pmm user_typeUNKNOWN
## 6  1  1 WSF5      pmm user_typeUNKNOWN
## parsnip model object
## 
## Fit time:  2.7s 
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      5202 
## Number of independent variables:  15 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       2.942535 
## R squared (OOB):                  0.9796524
## # A tibble: 2 x 4
##   model .metric .estimator .estimate
##   <chr> <chr>   <chr>          <dbl>
## 1 lm    rmse    standard       2.27 
## 2 rf    rmse    standard       0.778
## # A tibble: 2 x 4
##   model .metric .estimator .estimate
##   <chr> <chr>   <chr>          <dbl>
## 1 lm    rmse    standard        2.56
## 2 rf    rmse    standard        2.22

## # A tibble: 2 x 5
##   .metric .estimator  mean     n  std_err
##   <chr>   <chr>      <dbl> <int>    <dbl>
## 1 rmse    standard   1.70     10 0.0122  
## 2 rsq     standard   0.980    10 0.000510

The rmse of lm is ‘r rmse_train\(estimate[rmse_train\)model==’lm’]‘, and random forest is ’r rmse_train\(estimate[rmse_train\)model==’rf’]‘. Random Forest performes much better than lm. We resample the training set to produce an estimate of how the model will perform. The result is much better with rmse ’r rmse_rf\(mean[rmse_train\)metric==’rmse’]’

3.4.2 Build Regularized Regression Model (glmnet)

The model uses tidymodels package and takes considerations of holidays into data preparation, also the model runs through grid search on penalty with validation dataset.

## # Validation Set Split (0.8/0.2)  using stratification 
## # A tibble: 1 x 2
##   splits            id        
##   <named list>      <chr>     
## 1 <split [4.1K/1K]> validation
## # A tibble: 5 x 1
##    penalty
##      <dbl>
## 1 0.0001  
## 2 0.000127
## 3 0.000161
## 4 0.000204
## 5 0.000259
## # A tibble: 5 x 1
##   penalty
##     <dbl>
## 1  0.0386
## 2  0.0489
## 3  0.0621
## 4  0.0788
## 5  0.1

##         penalty .metric .estimator     mean n std_err
## 1  0.0001000000    rmse   standard 2.226662 1      NA
## 2  0.0001268961    rmse   standard 2.226662 1      NA
## 3  0.0001610262    rmse   standard 2.226662 1      NA
## 4  0.0002043360    rmse   standard 2.226662 1      NA
## 5  0.0002592944    rmse   standard 2.226662 1      NA
## 6  0.0003290345    rmse   standard 2.226662 1      NA
## 7  0.0004175319    rmse   standard 2.226662 1      NA
## 8  0.0005298317    rmse   standard 2.226662 1      NA
## 9  0.0006723358    rmse   standard 2.226662 1      NA
## 10 0.0008531679    rmse   standard 2.226662 1      NA
## 11 0.0010826367    rmse   standard 2.226662 1      NA
## 12 0.0013738238    rmse   standard 2.226662 1      NA
## 13 0.0017433288    rmse   standard 2.226662 1      NA
## 14 0.0022122163    rmse   standard 2.226662 1      NA
## 15 0.0028072162    rmse   standard 2.226662 1      NA
## 16 0.0035622479    rmse   standard 2.226662 1      NA
## 17 0.0045203537    rmse   standard 2.226662 1      NA
## 18 0.0057361525    rmse   standard 2.226662 1      NA
## 19 0.0072789538    rmse   standard 2.226662 1      NA
## 20 0.0092367086    rmse   standard 2.226662 1      NA
## 21 0.0117210230    rmse   standard 2.226651 1      NA
## 22 0.0148735211    rmse   standard 2.226492 1      NA
## 23 0.0188739182    rmse   standard 2.226671 1      NA
## 24 0.0239502662    rmse   standard 2.227317 1      NA
## 25 0.0303919538    rmse   standard 2.228833 1      NA
## 26 0.0385662042    rmse   standard 2.231191 1      NA
## 27 0.0489390092    rmse   standard 2.235415 1      NA
## 28 0.0621016942    rmse   standard 2.239631 1      NA
## 29 0.0788046282    rmse   standard 2.246045 1      NA
## 30 0.1000000000    rmse   standard 2.257211 1      NA
## # A tibble: 1 x 6
##   penalty .metric .estimator  mean     n std_err
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
## 1  0.0149 rmse    standard    2.23     1      NA
## # A tibble: 1 x 6
##   penalty .metric .estimator  mean     n std_err
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
## 1  0.0149 rmse    standard    2.23     1      NA
## # A tibble: 1,031 x 6
##    id         .pred  .row penalty trips model 
##    <chr>      <dbl> <int>   <dbl> <int> <chr> 
##  1 validation  1.56     5  0.0149     2 glmnet
##  2 validation  4.96    14  0.0149     4 glmnet
##  3 validation  5.91    17  0.0149     7 glmnet
##  4 validation  7.07    24  0.0149     5 glmnet
##  5 validation  7.72    32  0.0149     6 glmnet
##  6 validation 19.8     33  0.0149    18 glmnet
##  7 validation 15.0     35  0.0149    18 glmnet
##  8 validation 26.9     37  0.0149    27 glmnet
##  9 validation  7.35    42  0.0149     8 glmnet
## 10 validation 10.3     43  0.0149     9 glmnet
## # … with 1,021 more rows

The glmnet model runs training model with grid search penalty associated with lowerest rmse, and receives mean of rmse ‘r lr_rmse$mean’.

3.4.3 Build Ensemble model Random Forest

## [1] 12
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   num.threads = cores
## 
## Computational engine: ranger
## Collection of 2 parameters for tuning
## 
##     id parameter type object class
##   mtry           mtry    nparam[?]
##  min_n          min_n    nparam[+]
## 
## Model parameters needing finalization:
##    # Randomly Selected Predictors ('mtry')
## 
## See `?dials::finalize` or `?dials::update.parameters` for more information.
## # A tibble: 1 x 7
##    mtry min_n .metric .estimator  mean     n std_err
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    14     8 rmse    standard    1.58     1      NA

## # A tibble: 1 x 2
##    mtry min_n
##   <int> <int>
## 1    14     8
## # A tibble: 25,775 x 6
##    id         .pred  .row  mtry min_n trips
##    <chr>      <dbl> <int> <int> <int> <int>
##  1 validation  2.22     5    13    30     2
##  2 validation  4.77    14    13    30     4
##  3 validation  6.74    17    13    30     7
##  4 validation  6.54    24    13    30     5
##  5 validation  7.28    32    13    30     6
##  6 validation 19.2     33    13    30    18
##  7 validation 15.7     35    13    30    18
##  8 validation 27.1     37    13    30    27
##  9 validation  8.26    42    13    30     8
## 10 validation  9.68    43    13    30     9
## # … with 25,765 more rows

3.4.3.1 Compare glmnet and random forest models

Random forest model has a better rmse result than glmnet model ‘r top_rf\(mean' and 'r top_glmnet\)mean’.

3.4.3.2 Build random forest model test

The random forest model is built with the best grid search results and tested using hold out test dataset. The important variables are listed.

## # Monte Carlo cross-validation (0.75/0.25) with 1 resamples  
## # A tibble: 1 x 6
##   splits              id               .metrics         .notes           .predictions         .workflow 
##   <list>              <chr>            <list>           <list>           <list>               <list>    
## 1 <split [5.2K/1.7K]> train/test split <tibble [2 × 3]> <tibble [0 × 1]> <tibble [1,721 × 3]> <workflow>

The model test result shows that the rmses of training and testing are very close: ‘r top_rf\(mean' and 'r last_rmse\).estimate[last_rmse$.metric==’rmse’]’, which shows a good model on prediction.

3.4.4 Build Ensemble model gradient boost

3.4.5 Build nonparametric KNN Regressor

3.4.6 Build nonparametric SVM Regressor

3.4.7 Build Timeseries Forecasting model ARIMA

3.4.8 Build Timeseries Forecasting model HoltWinters

3.4.9 Build Timeseries Forecasting model ETS

3.4.10 Build Ensemble of Timeseries Forecasting model

3.5 Optimization Functions

4. Results

4. Conclusion, Summary and Future Work

5. References